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ABSTRACT 


Diabetes mellitus is a powerful chronic disease, which is recognized by lack 
of capability of our body for metabolization of glucose. Diabetes is one of 
the most dangerous diseases and a threat to human society, many are 
becoming its victims and, regardless of the fact that they are trying to keep it 
from rising more, are unable to come out of it. There are several 
conventional diabetes disease health monitoring strategies. This disease was 
examined by machine learning (ML) algorithms in this paper. The goal 
behind this research is to create an effective model with high precision to 
predict diabetes. In order to reduce the processing time, K-nearest neighbor 
algorithm is used. In addition, support vector machine is also introduced to 
allocate its respective class to each and every sample of data. In building any 
sort of ML model, feature selection plays a vital role, it is the process where 
we select the features automatically or manually and it contributes most to 
our desired performance. Overall, four algorithms are used in this paper to 


understand which can easily evaluate the total effectiveness and accuracy of 
predicting whether or not a person will suffer from diabetes. 
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1. INTRODUCTION 

Diabetes mellitus is a condition characterized by a metabolic process disorder and an excessive rise 
in blood sugar concentration due to lack of insulin, a peptide hormone secreted by pancreatic islet beta cells 
[1], [2]. If we step around the dangerous impact of diabetes, we will undoubtedly conclude that it can lead to 
significant complications, or even premature death. So, to decrease the mortality rate and improve a patient's 
health status. Therefore, machine learning algorithms are now used to identify and diagnose diseases in order 
to minimize the death risk and improve a patient's health status, as machine learning (ML) contributes to 
specific decisions. There are essentially two main clinical forms, type 1 diabetes and type 2 diabetes[3], 
which are indicated as (TID) and (T2D) [4] respectively, according to the origin and progression of the 
condition, which is nothing but the disorder's etiopathology. Nearly 90% of all diabetic patients have T2D, 
which is predominantly characterized by insulin hormone resistance. Lifestyle, physical activity, food or 
dietary patterns and inheritance are the real causes of T2D, TID is believed to be due to the 
autoimmunological degradation of pancreatic-B cells in the langerhans islets. 

Different researchers are designing a multiple diabetes prediction method based on a variety of 
algorithms. In [5] previously suggested a method for classifying diabetes disease via the use of the support 
vector machine (SVM). For diagnosis, the Pima Indian Diabetes (PID) dataset is used. Using the radial basis 
function (RBF) SVM kernel as the classifier, 78% of the accuracy was achieved. Orabiet al. [6] designed a 
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method for the prediction of diabetes. Pradhan and Bamnote [7] presents several genetic programmingrelated 
algorithms and several tests are performed on this dataset. As a classifier for diabetes disease prediction, in 
[8] uses J48 decision tree (DT) (74.8% accuracy) and naive bayes (79.5% accuracy). A prediction model with 
two sub-modules was developed in [9] to predict diabetes-chronic disease. 

In [10] estimates the 250 million individuals are currently affected by diabetes and will cross 500 
million by 2025. DT is used to find ways of extracting attributes and features from a fixed dataset [11]. Until 
testing, the dataset is trained to predict and store the results for each and every new instantiated object in a 
separate class. In [12] presents an algorithm that classifies the risk of diabetes mellitus[13]. This paper 
illustrates diabetes disease prediction based on the characteristics of the datasets. Gradient boost is adequate 
to equate logistic regression (LR), SVM, and k-nearest neighbor (KNN) to the rest of the classifier. Like 
dataset selection, extraction of attributes, implementation of algorithms after breaking the full dataset into a 
training and test dataset. Finally, the outcome shown in this paper demonstrates the proposed model’s ability 
to predict diabetes with less time in the earlier process. 

Section 1 is all about the introduction regarding Diabetes mellitus i.e., its cause, symptoms, types. 
Section 2 is about the research method, the model diagram and a brief description about the test and training 
dataset, the algorithms used for predicting the disease like SVM, KNN, LR, and gradient boost. Section 3 is 
describing the dataset in a clear way and the figure of outcome indicates the percentage patients with and 
without the disease. The section 4 depicts the results where we get gradient boost classifier give 81.25%. In 
the last section, the final conclusion of our processes is described based on our model. 


2. RESEARCH METHOD 

The required information and necessary steps in order to build the model to predict diabetes by using 
the classifiers are described as separate sections. The research approach mentioned in this paper explicitly 
defines the entire model's working criteria. Figure 1 depicts the procedure of proposed approach. The brief 
discussion of steps involved in the proposed approach are presented next: 

— The first step is collection of data Kaggle [14]. This dataset contains all total of 768 instances and 9 
attributes [14]. The dataset is briefly discussed in the dataset section. 

— The second step is data-preprocessing, in this step the null values are checked and removed also the 
categorical values converted into numerical ones. 

— In the third step exploratory data analysis [15] is performed where each columns correlation matrix was 
formed and also some visualizations like box plots [15] were done to check the outlier values. 
Some other visualization [16] was also done to check how [16] the features are related to the label 
column. Feature selection [17] which plays a major role is also done in this step where the important 
feature has been selected for the model and after the selection the data has been fit into the model for 
the prediction [17]. 

— In this step the dataset was divided into training and testing test [18]. For this work, the dataset has been 
divided into training and testing part with test size of 0.25. i.e., the training data [19] consists around 
75% of the whole data whereas testing data contains 25% of the whole dataset. In the dataset there is a 
total of 768 instances and 9 attributes, so the training data will contain 576 instances (75% of 768) and 
testing data will contain 192 instances (25% of 768). 

— Algorithms used: There are 4 algorithms used in this paper. Those are LR, KNN, SVM, and gradient 
boost. These 4 algorithms are discussed next. 


Data Exploratory Data 
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Algorithms Used 
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a < = KNN, SVM, Gradient Train Test Split 
Boost 


Figure 1. Proposed approach model diagram 


Collection of Data 


Prediction of diabetes disease using machine learning algorithms (Monalisa Panda) 


286 i) ISSN: 2252-8938 


2.1. Logistic regression (LR) 

Various algorithms are used to decide which is the best match for this dataset and will provide better 
results. Logistic regression, KNN, SVM, and gradient boost are such algorithms. LR helps us in solving 
classification problems, it uses S-curve instead of a straight line for fitting the points [20]. Logistic is taken 
from the function logit that is used in this method of classification[20]. 


2.2. K-nearest neighbor (KNN) 

It has two properties: i) lazy learning algorithm: since there is no separate step of preparation. It 
utilizes all the data during classification for training and ii) algorithm for non-parametric learning: about the 
underlying data, it does not assume anything. First of all, the data set is fed as input, including the dataset for 
testing and training. The loading of the dataset and data preprocessing takes place in the next step. After that 
decision, to get the desired results, the KNN algorithm is implemented. Steps involved in this algorithm are: 
— Using euclidean, manhattan or hamming distances, the distance between the test data and each row of 

training data is measured. 
— Now, arrange them in an ascending order based on the distance values. 
— Next the top k rows [21] are picked from the sorted list. 
— The class is allocated to the test point based on the frequent classes of these rows [21]. 


2.3. Support vector machine (SVM) 

Support vectors are the most important data points of the training dataset. If we remove the data 
points then the position of dividing hyper plane will change rather than being 2 non-overlapping classes [22]. 
And in constant visualization nonlinear separation works well as compare to linear one. Steps involved in this 
algorithm are: i) import the dataset; ii) explore the data to figure out what they look like [23]; iii) then data is 
split into attributes and labels [23]; iv) the data is divided into training and testing datasets; and v) SVM 
algorithm is trained for our desired output or results. 

The last step involved in this algorithm is the comparing all the precision [24] of the algorithms to 
get the results. This performance evaluation is carried out for all algorithms in this phase performance 
evaluation, the performance of the models has been evaluated using the confusion matrix and classification 
report where the accuracy, recall, f1-score for each algorithm has been calculated, the comparison between 
all algorithms is discussed in the results section. The output of a classification model is represented using a 
confusion matrix. The uncertainty/confusion matrix can represented by, 


TRUE* | 


Confusion Matrix = oe TRUE 


(1) 
True positive (TP): cases in which the classifier predicted TRUE (they have the disease) and TRUE was the 
correct class (patient has disease). Real negatives (TN): cases where FALSE (no illness) was predicted by the 
model and FALSE was the right class (patient do not have disease). False positives (FP) (type I error): the 
classifier predicted TRUE, but FALSE was the correct class (patient did not have disease). False negatives 
(FN) (type II error): instances where FALSE (patients have no disease) has been predicted by the machine 
learning model, but they actually have the disease. 
Key Performance Indicator (KPJ) calculation is as follows: 
— Accuracy of classification=(TP+TN)/(TP+TN+FP+EN) 
—  Misclassification rate=(FP+FN)/(TP+TN+FP+FN)=(error rate) 
—  Precision=TP/Total TRUE Predictions=TP/(TP+FP) (how much was it accurate when the model 
predicted the TRUE class?) 
—  Recall=TP/Real TRUE=TP/(TP+FN) (how much did the classifier get it right when the class was 
actually TRUE?) 


3. DATASET DESCRIPTION 

This dataset has been taken from Kaggle [14]. The datasets consist of 768 rows and 9 columns, with 
8 rows being instances, and the target variable being | row (output). The target variable is outcome, while 
Predictor variables include the patient's number of births, Triceps skin fold thickness measurement (mm), 
their body mass index (BMI) (weight in kg/(height in m)*2), blood pressure diastolic (mm Hg), insulin level, 
era, and so on. Figure 2 depicts the instances of outcome column. The outcome column is the target variable 
and contains zeros and ones where zeros represent that patient don’t have diabetes disease whereas one 
represents patient have diabetes disease. From Figure 2 we can say that there are 500 instances present in the 
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outcome column are 0 and 268 are | that is around 66.10% don’t have diabetes and around 34.90% have 
disease in the dataset. 


600 


500 
~ 400 
i<j 
2 300 : . 
° 200 \ ane 
100 N @ Yes 
0 S 
No Yes 
Target = No 65.10% » Yes 34.905 


Figure 2. Instances of outcome 


4. RESULTS AND DISCUSSION 

The final result has been derived successfully using the mentioned four machine learning 
algorithms. Table | shows the accuracy of all the models used in this paper, their precision, recall and f1- 
score is also shown in Table 1. By comparing all the algorithms, it can be observed that the best algorithm 
based on accuracy is gradient boost with an accuracy of 81.25%. 

Figure 3 depicts the accuracy comparison among the models used for this work. From this figure it 
can concluded that Gradient Boost algorithm gives the best accuracy of around 81.25% and KNN gives the 
lowest accuracy of 78% among these four algorithms, in addition to this logistic regression gives 81% 
whereas Support vector classifier gives 80% accuracy. 


Table 1. Comparison among the models 


Name Accuracy Precision Recall _ Fl-score 
Gradient Boost 0.8125 0.7600 0.6290 0.6964 
Logistic Regression 0.8073 0.7660 0.5806 0.6605 
SVM 0.8021 0.7400 0.5968 0.6607 
KNN 0.7813 0.7273 0.5161 0.6038 
Model Accuracy ,,, 0:82 
GradientBoostingClassifier 0.81 3 0.8 
LogisticRegression 0.81 = 0.78 a a na | 
SVC 0.8 = 0.76 
kNeighborsClassifier 0.78 eto 
oo Z a 
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Figure 3. Model accuracy comparison 


Figure 4 depicts the precision comparison among the models used for this work. From this figure, it 
can be concluded that gradient boost algorithm gives the best precision score of around 0.76 and k-neighbors 
classifiers gives the lowest precision score of 0.73 among these four algorithms, in addition to this LR [25] 
gives 0.77 whereas support vector classifier gives 0.74 precision score. Figure 5 depicts the precision 
comparison among the models used for this paper. From this figure, it can be concluded that gradient boost 
algorithm gives the best recall score of around 0.63 and K-Neighbors classifiers [26] gives the lowest recall 
score of 0.52 among these four algorithms, in addition to this logistic regression gives 0.58 whereas Support 
vector classifier gives 0.6 recall score. 

Figure 6 depicts the Fl-score comparison among the models used for this work. From this figure, it 
can be concluded that gradient boost algorithm gives the best Fl-score [27] of around 0.7 and k-neighbors 
classifiers gives the lowest Fl-score of 0.6 among these four algorithms, in addition to this logistic regression 
and support vector classifier gives same Fl-score of 0.66. From the simulation results, it can be concluded 
that in this paper there are total 4 algorithms are used LR [28], KNN [29], gradient descent [30], and the best 
accuracy achieved is 81.25% which has given by gradient descent classifier. 
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Figure 4. Model precision comparison 
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Figure 5. Model recall comparison 
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Figure 6. Model F1l-score comparison 


5. CONCLUSION 

Using sophisticated statistical techniques and the availability of a large number of epidemiological 
and genetic diabetes risk datasets, ML has the considerable potential to restructure or shake up the risk of 
diabetes prediction. It is clearly seen from this paper that gradient boosting classifier works well for this type 
of dataset, which is also confirmed by model accuracy and recall. And KNN works well for the dataset 
includes a large number of datasets that it is easier to minimize processing time. And SVM deals with a wide 
number of functions for the dataset in a better way. This model can be used for future work, this application 
can be used by taking patients’ past health records and showing whether or not the person has diabetes. 
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